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This paper concerns the use of sequential Monte Carlo methods (SMC) for smoothing in general 
state space models. A well-known problem when applying the standard SMC technique in the 
smoothing mode is that the resampling mechanism introduces degeneracy of the approximation 
in the path space. However, when performing maximum likelihood estimation via the EM algo- 
rithm, all functionals involved are of additive form for a large subclass of models. To cope with 
the problem in this case, a modification of the standard method (based on a technique proposed 
by Kitagawa and Sato) is suggested. Our algorithm relies on forgetting properties of the filtering 
dynamics and the quality of the estimates produced is investigated, both theoretically and via 
simulations. 
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1. Introduction 

In this paper, we study SMC methods for smoothing in nonlinear state space models. We 
consider a bivariate process (X,Y), where X = {Xf~;k > 0} is a homogeneous discrete- 
time Markov chain taking values in some state space (X, X). We let (Qg,9 e8C M. d ) and 
v denote the Markov transition kernel and the initial distribution of X, respectively. The 
family {Qe(x, ■); igX, 9 G 0} is assumed to be dominated by the probability measure 
H on (X, X) and we denote by qg(x, ■) the corresponding Radon-Nikodym derivatives. In 
this framework, X is not observed and measurements must be made through the process 
Y = {Yfc; k > 0} taking values in some measurable space (Y,y). These observed variables 
are conditionally independent, given the sequence {Xk]k> 0}, and the conditional dis- 
tribution of Yj; depends only on X^. We denote by Qk the cr-algebra generated by the 
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observed process from time zero to k. Furthermore, there exist, for all x £ X and 9 € 0, 
a density function y i— > gg(x,y) and a measure A on (Y,3^) such that, for k > 0, 



Fe(Y k eA\X k = x) 



ge(x,y)X(dy) 



for all AGy. 



Here, we have written p# for the law of the bivariate Markov chain {(X^, Yj.); k > 0} under 
the model parameterized by £ and we denote by Eg the associated expectation. 

For i <j, let Xi y j = (Xj, . . . ,-Xj-); similar vector notation will be used for other quan- 
tities. In many situations, it is required to compute expectation values of the form 
Ee[t n (Xo :n )|^ n ], where t n is a real-valued, measurable function. In this paper, we fo- 
cus on the case where t n is an additive functional given by 



t n (X0:n) = X! s k( x k:k+l), 



(1) 



fc=0 



where {sk', k > 0} is a sequence of measurable functions (which may depend on the ob- 
served values Yo- n ). 

As an example of when smoothing of such additive functionals is important, consider 
the case of maximum likelihood estimation via the EM algorithm. Having an initial 
estimate 9' of the parameter vector available, an improved estimate is obtained (we refer 
to Cappe et al. [2], Section 10.2.3) by means of computation and maximization of Q(9; 9') 
with respect to 9, where Q(9;9') is defined by 



Q{9;9 , )^E 6 



^2logqe(X k ,X k+1 ) 



k=0 



^ log ge(X k ,Y k ) 



k=0 



E e ,[logv(X )\g n 



This procedure is recursively repeated in order to obtain convergence to a stationary point 
9+ of the log-likelihood function t v ^ n {9; Yo:n) — logL I/j „(6'; Yo :n ), where, for yo :n £ Y™ +1 , 



^v,n{9] yo.r 



X n+1 



ge(xo,yo)v(xa) q e {xk-i,x k )ge{xk,yk)^® i - n+1) {dxo-.n). 



k=l 



The computation of smoothed sum functionals of the above form will also be the key 
issue when considering direct maximum likelihood estimation via the score function 

y 0^u^n\"'i V0:n 

); again see Cappe et al. ([2], Section 10.2.3) for details. 

By applying Bayes' formula, it is straightforward to derive recursive formulas for ex- 
pectations of the additive type discussed above. However, tractable closed form solutions 
are available only if the state space X is finite or the model is linear and Gaussian. 

SMC methods (also known as particle filtering methods) constitute a class of algo- 
rithms that are well suited for providing approximate solutions of the smoothing and 
filtering recursions. In recent years, SMC methods have been applied, sometimes very 
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successfully, in many different fields (see Doucet et al. [6] and Ristic et al. [14] and the 
references therein). A well-known problem when applying SMC methods to sample the 
joint smoothing distribution is that the resampling mechanism of the particle filter intro- 
duces degeneracy of the particle trajectories. Doucet et al. [7] suggest a procedure where 
this is avoided through an additional resampling pass in the time-reversed direction. The 
resulting algorithm is well suited to sample from the joint smoothing distribution, but 
appears unnecessarily complex, computationally, for approximating additive smoothing 
functionals of the form (1). 

In this paper, we study an SMC technique to smooth additive functionals based on 
a fixed-lag smoother presented by Kitagawa and Sato [11]. The method exploits the 
forgetting properties on the conditional hidden chain and is not affected by the degeneracy 
of the particle trajectories. Compared to Doucet et al. [7], computational requirements 
are marginal. Furthermore, we perform, under suitable regularity assumptions on the 
latent chain, a theoretical analysis of the behavior of the estimates obtained. It turns out 
that the L p error and bias are upper bounded by quantities proportional to n\ogn/\/~N 
and n\ogn/N, respectively, where N denotes the number of particles and n the number 
of observations. 

In comparison, applying the results of Del Moral and Doucet ([4], Theorem 4) to a 
functional of type (1) provides a bound proportional to n 2 /yN on the L p error for the 
standard trajectory-based particle smoother. Finally, we apply, for a noisily observed 
autoregressive model and the stochastic volatility model proposed by Hull and White [9] , 
the technique to the Monte Carlo EM (MCEM) algorithm (Wei and Tanner [15]). 



2. Particle approximation of additive functionals 
2.1. The smoothing recursion 

The joint smoothing distribution </v.o : n|n is the probability measure defined, for A € 

Af «»(n+l) ) by 



Under the assumptions above, the joint smoothing distribution has a density (for which 
we will use the same symbol) with respect to fj®( n+1 > satisfying, for all yo-.k+i € Y fc+2 , 
the recursion 



For notational conciseness, we will omit the explicit dependence on the observations from 
the notation for the smoothing measure and replace 0i/,o:fc|fe[2/O:fe](s 0) by <t>u,o-.k\k{,'\ 

Particle filtering, in its most basic form, consists of approximating the exact smoothing 
relations by propagating particle trajectories in the state space of the hidden chain. Given 



<f> v ,0:n\nlYo:n\(A;9)±Fe(X 0:n eA\g n ). 



0i/,o:jfe+i|*:+i luo-.k+i] (xo-.k+i ; 0) 



qe(x k ,x k+1 )g g {x k+1 ,y k+1 )(j)v$, k \ k [y 0:k ]{x 0:k -,9). 



(2) 



L,y,fc+i(#;yo:fc+i) 
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a fixed sequence of observations, this is done according to the following scheme. In order 
to keep the notation simple, we fix the model parameters and omit 6 from the notation 
throughout this part. 

At time zero, N random variables {^q' 1 ', 1 < i < N} are drawn from a common prob- 
ability measure c; such that v <C These initial particles are assigned the importance 
weights uJq' 1 = Wo(^q' 1 ), 1 < i < N, where, for x £ X, Wo(x) = g(x, yo) dv/d<;(x), pro- 
viding Y2i=i UJ o' % f(€o' % )/Yli=x t * J o ,% as an importance sampling estimate of 4> U fi\of for 
/ £ i3b(X). Henceforth, the particle paths £ .,£ = [£ O :m(0)j ■ ■ • >£o:m( m )]i 1 - * - N i are 
recursively updated according to the following procedure. 

At time k, let {(£,q.£, 1 < i < N} be a set of weighted particles approximat- 
ing <f>v,o-.k\k, in the sense that Ei=i w f'V(^o?fe ; with 0^ = EiIi w /f' 1 an d /£ 
2?b(X fc+1 ), is an estimate of the expectation 4> U fi-.k\kf- Then, an updated weighted sam- 
ple 1 < * < A}, approximating the distribution 0„.o : fc+i|fc+i ; i s obtained 
by, first, simulating £ -fc+i ~ ^fc(C(>fci'): where the kernel i?£ is of type R\(xQ-k, f) = 
J x f(xo-.k,Xk+i)Rk(xk,dxk+i), with / £ £>b(X fc+2 ) and each Rk being a Markov transi- 
tion kernel. The new particles are simulated independently of each other and the special 
form of R^ implies that past particle trajectories are kept unchanged throughout this 
mutation step. A popular choice is to set Rk = Q, yielding the so-called bootstrap fil- 
ter; more sophisticated techniques involve proposals depending on the observed values 
(see Example 4.2). Second, when the observation Yfc+i = y k +i is available, the impor- 
tance weights are updated according to = w k ,% Wk+i [£o-k+i(k '■ ^ + !)]> where, for 
{x,x')eX 2 , W k (x,x') ±g(x',y k )dQ(x,-)/dR k -i(x,-)(x'). Now, for / £ S b (X fc + 2 ), the 
self-normalized estimate <t>vo-k+x\k+if ~ ^j=i P rov ides an approxi- 
mation Of 0„ jO :k+l|k+l- 

To prevent degeneracy, a resampling mechanism is introduced. In its simpler form, 
resampling amounts to drawing, conditionally independently, indices ij^' 1 , ■ ■ ■ >I k ' N f r °ni 
the set {1,...,N}, multinomially with respect to the normalized weights u^' 3 '/ZlP , 
1 < j < N. Now, a new equally weighted sample {i^.'k', 1 < i < A} is constructed by 
setting t^Q.j? = £,Q.'k h ■ After the resampling procedure, the weights are all reset as 
' l = 1/A, yielding another estimate, ^ fi , k \ k f = Ej=i f(£o : k)/ N > of <t>v,0:k\k- Note that 
the resampling mechanism might modify the whole trajectory of a certain particle, im- 
plying that, in general, for m < n, £(>n( TO ) £o-n+i( m )- The multinomial resampling 
method is not the only conceivable way to carry out the selection step (see e.g. Doucet 
et al. [6]). 

Using the weighted samples {(^' k , w£' J ); 1 < j < A}, < k < n, produced under the 
parameter 8 £ G, an approximation of je.n = ^e[t n (Xo :n )\Gn] is obtained by constructing 
the estimators 



1 



JV 



3=1 



or 



7«L 



JV 



0:ri 



(3) 
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Figure 1. Typical particle trajectories for N — 50; see Section 4 for details regarding model 
and algorithm. 



When the functional {t„} has the form given in (1), it is straightforward to verify 
that recording all of the particle trajectories is indeed not required to evaluate (3): upon 
defining t k ' 1 = tk(^Q. k ) 7 we have, for k > 1, 



N,i _ j + Sk KoTfc+i ( k '■ k + !)] ) if 110 resampling occurs, 

[t k + + s k [t; . k+l {k : k + l)J, if resampling occurs. 



The recursion is initialized by t^' 1 = ii(£(>V)- In accordance with (3), is obtained as 
£i=i ^n'^n^/^n ■ Hence, for each particle we need only record its current position 
£ .k (fc), weight tu k ^ and associated functional value t k '*. Thus, the method necessitates 
only minor adaptations once the particle filter has been implemented. 

As illustrated in Figure 1, as n increases, the path trajectories system collapses, and 
the estimators (3) are not reliable for sensible N values (see Doucet et al. [6], Kitagawa 
and Sato [11] and Andrieu and Doucet [1] for a discussion). 

To cope with this drawback, we suggest the following method, based on a technique 
proposed by Kitagawa and Sato [11]. By the forgetting property of the time-reversed 
conditional hidden chain (Theorem 3.1), we expect that, for a large enough integer A„ < 
n — k, 



^e[sk{X k :k+l)\Gn] ~ ^e[sk(X k :k+l)\Gk+A n ], 



(5) 
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yielding, with k(A n ) = (k + A„) A n, 



-fe, n = Eg 



^ Sk(Xk:k+l] 



k=0 



n-1 



^2Eg[s k (X k: k + l)\Gk(A n )}- 



k=0 



The above relation suggests that waiting for all of the trajectories to collapse - as (4) 
implies - is not convenient. Instead, when the particle population N is sufficiently large 
so that (5) is valid for a lag A„ which may be far smaller than the typical collapsing 
time, one should apply the two approximations 



n-1 N t^N,] 
*^E# l <A.)(*:Hl)l 1 



fe=0 3=1 "*(A„) 
i-l N 



le,n = 



(6) 
(7) 



fc=0 3 = 1 



of 7e. n- Although somewhat more involved than the standard approximation (3), the 
above lag-based approximation may be updated recursively by recording the recent his- 
tory of the particles as well as the accumulated contribution of terms that will no longer 
get updated. Thus, apart from increased storage requirements, computing the lag-based 
approximation 7^' A ™ is clearly not, from a computational point of view, more demanding 
than computing n . 



3. Theoretical evaluation of the fixed-lag technique 

To accomplish the robustification above, we need to specify the lag A„ and how this lag 
should depend on n. This is done by examining the quality of the estimates produced by 
the algorithm in terms of bias and L p error. Of particular interest is how these errors are 
affected by the lag and whether it makes their dependence on n and N more favorable 
in comparison with the standard trajectory-based approach. 

The validity of 5 is based on the assumption that the conditional hidden chains - 
in the forward as well as the backward directions - have forgetting properties, that is, 
the distributions of two versions of each chain starting at different initial distributions 
approach each other as time increases. This property depends on the following uniform 
ergodicity conditions on the model, which imply that forgetting occurs at a geometrical 
rate: 

(Al) (i) cr_ = m£ ee &iiif XtX , e xqe(x,x') > 0, a + = sup„ e0 sup x x , eX q g (x,x') < oo; 
(ii) for all y g Y, sup eee \\g e (-, y)\\x,oo < oo, inf eee J x 9e(x,y)fi(dx) > 0. 

Under (Al), we define 
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We now define the Markov transition kernels that generate the conditional hid- 
den chains. For any two transition kernels K and T from (Ei,£i) to (£2,^2) and 
(E 2 ,(?2) to (E 3 ,£ 3 ), respectively, we define the product transition kernel by KT(x,A) = 
j E T(z,A)K(x,dz) for £ G Ei and Ae£ 3 . 

Introduce, for / G £>b(X fc+2 ), xo :k G X fc+1 and yk+i G Y, the unnormalizcd path- 
wise transition kernel L k (x 0:k , f;9) = f x f(xo:k+i)ge(xk+i,yk+i)Qe(xk,dx k +i). Assump- 
tion (Al) makes this integral well defined for all k > 0. We will often consider composi- 
tions 



r. m 

■■■L m (x ; k ,f;9)= / f{xo;m+i)T\[ge(zi+i,yi+i)Qe{xi,dx i+ i 



with / G 6 b (X m+2 ), x 0:k G X fe+1 and j/ 0: fc € Y m - fc+1 , and it is clear that, for all k < m, 
the function L k ■ ■ ■ L m (xQ [k , X m+2 ; #) depends only on x k . Thus, a version of this function 
comprising only the last component is well defined and we write L k ■ ■ ■ L m (x k ,X m+2 ;9) 
in this case. For k > m, we set L k ■ ■ ■ L m = Id. Using this notation and given n > 0, 
the forward smoothing kernels given by, for k > 0, x k G X and A G X, F k \ n (x k: A;9) = 
¥e(Xk+i G A\X k = Xk,Gn)i can, for indices < k < n and y k +i G Y, be written as 

F k \ n (x k ,A;0) 

(2) 

ge{xk+i,yk+i)L k +i ■ ■ ■ L n ^i(x k+ i,X n+1 ;9)Q e (x kl dx k+1 ) 

L k ---L n _ 1 (x k ,X»+ 1 ;9) 
for < k < n, 
Qg(x k ,A), for k>n. 

Analogously, for the time-reversed conditional hidden chain, we consider the backward 
smoothing kernels defined by B v u n (x k +i, A; 6) = fg(X k G A\X k+ i = x k+ i,Q n ), where 
k > 0, x k +i G X and A G X. Note that B vk \ n depends on the initial distribution of the 
latent chain. The backward kernel can be expressed as 



(x k+1 ,A;9) 

( J A qe{x k ,x k+1 )(j)^ k (dx k ;9) 
_ I Ix1s(x' k ,x k+1 )(j)^. k (dx' k ;9) ' 

I A Jx I (f* ' X k+l) a e~ n ( X ™ ! x k)(f>v,n (dx n j 6)fl(dx k ) 

J x q k g - n+1 (x' n ,x k+1 )^ n (dx' n ;9) 

where, for m > 1 , qf 1 denotes the density of the m-step kernel Q\ 

The following theorem (see Del Moral [3], page 143), stating geometrical ergodicity 
of the forward and backward chains, is instrumental for the developments which are to 
follow. 



for < k < n, 
for k > n, 
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Theorem 3.1. Assume (Al) and let p be defined in (1). Then, for all k > m > 0, all 
9 £ O, all probability measures v\, vi on X and all yo-.n G Y" + , 

lkiF m |„ • • -F fc |„(-;0) - v 2 F m \ n ■ ■ •F k \ n (-;6)\\ T y < p ~ m+1 , 

\Wi^u,k\n • • -B„ jm | n (-;0) - iAjBj, jfc | n ■ ■ ■B„ m \ n (-;6)\\ T y < p ~ m+1 . 

Assumption (Al) typically requires that X is a compact set, but some very recent 
papers (Douc et al. [5], Kleptsyna and Veretennikov [12]) provide results that estab- 
lish geometric forgetting under considerably weaker assumptions. Applying these results 
within our framework would, however, make the analysis far more complicated since the 
provided bounds are uniform neither in the observations nor the initial distributions. 



3.1. Main results 

For the sake of simplicity, let us assume that multinomial resampling is applied at every 
iteration. Moreover, let the observations used by the particle filter be generated by a 
state space model with kernel, measurement density and initial distribution Q, g and P, 
respectively. We stress that Q and g are not assumed to belong to the parametric family 
{(Qe, ge)',G € 0}. Using these observed values as input, the evolution of the particle 
cloud follows the usual dynamics (Qe^ge-,^, € 0) and, in this setting, it is easily verified 
that the process {Zk]k > 0}, with Zk — [£,^k(^ ~ 1 • ■ • • ' ^offe^C^ — 1 : fc), Xfc, Y^], is 
a Markov chain on X 2N+1 x Y. We denote by and E^ the law of this chain and 
the associated expectation, respectively, and define the filtration {T^ '; k > 0} by F k N +1 = 

•^ Vcr (£o:fe+i> ■ ■ ■ >£o:k+x) witn — (7 (^o' 1 ^ ■ • -i£o' N )- Tne marginal of P^ with respect 
to {(Xk, Yk); k > 0} and the associated expectation are denoted by P and E, respectively. 
For any integer p > 1, random variable V € L P (P^ ) and sub-tr-algebra A C a({Zk] k > 0}) 
we define the conditional norm \\V\\ p{A = (E g y [\V\ p \A]) 1/p . 

(A2) Forallfc>0, 6 e and y k £ Y, \\W k (-;9)\\ X 2 :00 < oo. 

Remark 3.2. In case of the bootstrap particle filter, for which Rg ^ = Qe, assumption 
(A2) is implied by assumption (Al). The same is true for the so-called optimal kernel 
used in Example 4.2. 

Theorem 3.3. Assume (Al) and (A2). There then exist universal constants B p and B, 
Bp depending only on p, such that the following holds true for all n > 0, 9 € 0, A n > 
and N > 1 ; 

(i) for all p>2, 

||7e, n -le,n\\p\g n 

n-A n 

<2p A " J2 Pfcllx-.oo 

k=0 
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B 



n-l 



k=0 



± k y ||^ m (-;g)|l x2 ,ooP OV(fc -" l) 



\w (-,e)\\ X!OoP k 



(ii) 



n-A„ 

<2p A " INI* 2 . 



AT(l-p)' 



n-l 

■ 5l ll S fe||x2,00 



1_ \\W m (-;0)\\ 2 x ^ oo P°^ k ' m '> 

2 2^ 



m— 1 



{»(^)} 2 



For the purpose of illustrating these bounds, assume that we are given a set 
{yk',k > 0} of fixed observations and that all ||sfc||x 2 ,oo> as weu as all fractions 
\\Wk(-;0)\\x2 JX ,/figg(yk), are uniformly bounded in fc. We then conclude that increas- 
ing the lag with n as A„ = [clogn] , c > — 1/ logp, will imply that np An tends to zero as 
n goes to infinity, leading to an error which is dominated by the variability due to the 
particle filter (the second term of the bound in Theorem 3.3(i)) and upper bounded by 
a quantity proportional to 



E 

k=0 



'k-\- \c log n~\ 

E . 

m— 1 



OV(fe-m) 



< 



1 



\fN\l-p 



1 + \c\ogn\ , 



that is, of order nlogn/V N . Note the dependence on the mixing coefficient p of this 
rate. In contrast, setting A„ = n, that is, using the direct full-path approximation, would 
result in a stochastic error which is upper bounded by a quantity proportional to n 2 j \f~N . 



3.2. Extension to randomly varying observations 

As mentioned, all results presented above concern smoothing distribution approxima- 
tions produced by the particle filter algorithm conditionally on a given sequence of ob- 
servations. In this section, we extend these results to the case of a randomly varying 
observation sequence. 

For the bounds presented in Theorem 3.3, the conditioning on Q n can be removed by 
introducing additional model assumptions. In the following, we suppose that v <C p and 
that the resulting Radon-Nikodym derivative satisfies {dv/dp)- = infa; g xdi//d/i(a;) > 0. 
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(A3) Let t n be given by (1). For p > 2, t > 1 and 6* € 0, there exists a constant 
dp,i(tn', 9) G K + such that 



max< E 



IM^IILJWI 



x 2 . 



{H9e(Yk)}*> 



,E[||*|| 



» MX" 



];0<fc<n,0<i<n-l 



< a P ,i(t n ', 9). 



Proposition 3.4. Assume (Al) and (A2). There then exist universal constants B p and 
B, Bp depending only on p, such that the following holds true for all N > 1; 

(i) if assumption (A3) is satisfied for £ = p > 2 and 9 € 0, then 

\\le,n" - TmIIp < Kiv^ *) A(n - A n + 1) 
, B p a p {g(t n ;6) ( A n (n + 1) 



(di,/d M ) 2 



VJV(1 - p) 

(ii) if assumption (A3) is satisfied for p = 2, £ = 1 and # G 0, i/ien 
\®ellek An ~ 7e,n}\ < 2a 2A (t n ;9)p A "(n - A„ + 1) 



Ba 2A {t n ;9) J A»(n + 1) 
A(l-p) 2 \ a 2 . 

The proof of this result is given in Section A. 2. 



1 



e^(l-p) (d^) 2 _ 



Remark 3.5. In the case of a compact state space X, assumption (A3) implies only 
limited additional restrictions on the state space model. In fact, for a large class of 
models, assumption (A3) follows as a direct consequence of assumption (Al). 



4. Applications to maximum likelihood estimation 

We now return to the computation of the maximum likelihood estimator. In the follow- 
ing, we consider models for which the set of complete data likelihood functions is an 
exponential family, that is, for all 6 € and n > 0, the joint density of (Xo :n ,Yo :n ) is of 
the form exp[(ip(0),S n (xo :n )) — c(9)]h(xo :n ) . Here, ip and the sufficient statistics S n are 
Revalued functions on and X™ +1 , respectively, c is a real- valued function on 9 and h 
is a real- valued non- negative function on X™ +1 . By (•, •) we denote the scalar product. All 
of these functions may depend on the observed values yo :n > even though this is expunged 
from the notation. 

If the complete data likelihood function is of the particular form above and the expecta- 
tion (j)i>,0:n\n(S n ; 9) is finite for all 9 £ 0, then the intermediate quantity of EM can be writ- 
ten as (up to quantities which do not depend on 9) Q(8;8') = (ip(9),(f) u ^. n \ n (S n ;9')} — c(9). 
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Note, finally that, as mentioned in the Introduction, a typical element S„ im (io ;n ), 1 < 
m<d s , of the vector S n (xo:n) is an additive functional S„ im (xo:n) = J2k 
so that 4> U fi:n\n(Sn',9') can be estimated using either (6) or (7). Denoting by S n such an 
estimate, we may approximate the intermediate quantity by 

Q N {e-e') = (mX)'c{e). 

In the next step - referred to as the M-step - Q N (8;9') is maximized with respect to 
9, providing a new parameter estimate. This procedure is repeated recursively given an 
initial guess 9q- 

As an illustration, we consider the problem of inference in a noisily observed AR(1) 
model and the stochastic volatility (SV) model. None of these examples satisfy assump- 
tion (Al); however, geometric ergodicity for the models in question can be established 
using bounds presented by Douc et al. [5]. Although these bounds are somewhat more 
involved than those presented in Theorem 3.1 (e.g., the former depend on the initial 
distributions and the observations), we may, nevertheless, expect that the conclusion 
reached in Section 3, that is, that the error of the fixed-lag approximation is controlled 
by a lag of order logn, still applies. The situation is complicated, however, by the fact 
that the mixing rates depend on the observations and are uniform only under the expec- 
tation operator. In other words, there may be occasional outcomes for which mixing is 
poor, even if the average performance of the system is satisfactory. 

Example 1^.1 (SMCEM for noisily observed AR(1) model). We consider the 
state space model 

Xk+i = a Xk + VwWk+i, 
Y k =X k + a v V k 

with {Wk',k > 1} and {Vk',k > 0} being mutually independent sets of standard normal 
distributed variables such that Wk+i is independent of (Xj,Yj), < i < k, and V4 is 
independent of Xk, (Xi,Yi), < i < fc — 1. The initial distribution is chosen to be a diffuse 
prior so that </>„,o|o 1S N{yo,cr%). Throughout the experiment, we use a fixed sequence 
of observations produced by simulation under the parameters a* = 0.98, c* = 0.2 and 
<7* = 1. In this case, ip(6) = [l/2er^, — a/crj, a 2 /(2a^), l/(2a 2 v )] and the components of 
the M 4 -valued function x 0: „ h-> S n (x Q:n ) are given by S n s(x 0:n ) = Y,k=i x l> S ny2 (xQ : n) — 

J2k=o x k%k+i, S ni3 (xo:n) = Y,k=o x k and S nA x 0:n) =XTfe=o(yfc ~ Xk ) 2 - Furthermore, up 
to terms not depending on parameters, c(0) — nlog(of u )/2 + (n + 1) log(cr 2 )/2. In this 
setting, one step of the MCEM algorithm is carried out in the following way. Having 
produced an estimate Q l ~ x of the parameters 9 = (a,cr^,cr 2 ) at the previous iteration, 
we compute an approximation S n — (S n ,i, S n ,2, S n> 3, S nt i) of <j> v ,0:n\n(Sn] 9 l ~ 1 ) using the 
particle filter and update the parameters according to 
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n=lOO 
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l 2 i ■! : S " 
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Figure 2. Boxplots of estimates of <f)v,0:n\nSn,i/n, produced with the fixed-lag technique, for 
the noisily observed AR(1) model in Example 4.1. 



We simulated, for each n = 100, 1000, 10,000 observations, 1000 SMC estimates of 
4 ) v,0:n\n.Si using the fixed-lag smoothing technique for the parameter values a = 0.8, a w = 
0.5 and a v = 2. Here, the standard bootstrap particle filter with systematic resampling 
was used, with Rk = Q for all k > 0. The dotted lines indicate the exact expected values, 
obtained by means of disturbance smoothing. To study the bias-variance trade-off - 
discussed in detail in the previous section - of the method, we used six different lags 
for each n and a constant particle population size N = 1000. The result is displayed in 
Figure 2, from which it is evident that the bias is controlled for a size of lag that increases 
approximately logarithmically with n. In particular, from the plot, we deduce that an 
optimal outcome is gained when lags of size 2 4 , 2 4 and 2 5 are used for n being 100, 1000 
and 10,000, respectively 

When the lag is sufficiently large so that we can ignore the term of the bias which is 
deduced from forgetting arguments (being roughly of magnitude np An ), increasing the 
lag further exclusively leads to an increase of variance, as well as bias, of the estimates; 
compare the two last boxes of each plot. This is completely in accordance with the 
theoretical results of Section 2. Note that the scale on the y-axis is the same for the three 
panels, although the y-axis has been shifted in each panel due to the fact that the value 
of the normalized smoothed statistic evolves as the number of observations increases. 

In Figure 3, we again report the cases n = 100, 1000, 10,000 observations and compare 
the basic approximation strategy (4) with the one based on fixed-lag smoothing with 
suitable lags. Guided by the plots of Figure 2 and the theory developed in the previous 
section, we choose the lags 2 4 , 2 4 and 2 5 , respectively. The number of particles was set 
to 1000 for all n. It is obvious that fixed-lag smoothing drastically reduces the variance 
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Figure 3. Boxplots of estimates of cf> u ,0:n\nSn,l/ n j produced by means of both the fixed-lag 
technique and standard trajectory-based smoothing, for the noisily observed AR(1) model in 
Example 4.1. Each box is based on 200 estimates, and the size of the particle population was 
N = 1000 for all cases. 



without significantly raising the bias. As in the previous figure, dotted lines indicate ex- 
act values. As expected, the bias of the two techniques increases with n since the number 
of particles is held constant. 

Example 4-2 (SMCEM for the stochastic volatility (SV) model). In the discrete- 
time case, the canonical version of the SV model (Hull and White [9], Jacquier et al. [10]) 
is given by the two relations 

Xk+i = aX k +<T€k+x, 
Y k = [3cxp(X k /2)e k 

with {e k ',k > 1} and {s k ]k > 0} being mutually independent sets of standard normal 
distributed variables such that W k +\ is independent of (Xi,Yi), < i < k, and V k is 
independent of X k , (Xi,Yi), 0< i< k— 1. 

To use the SV model in practice, we need to estimate the parameters 9 = (/3, a, a). 
Throughout this example, we will use a sequence of data obtained by simulation under 
the parameters (3* = 0.63, a* = 0.975 and a* = 0.16. These parameters are consistent 
with empirical estimates for daily equity return series and are often used in simulation 
studies. In conformity with Example 4.1, we assume that the latent chain is initialized 
by an improper diffuse prior. The SV model is within the scope of exponential fam- 
ilies, with ip(0) = [-a 2 /(2CT 2 ),-l/(2cr 2 ),a/cr 2 ,-l/(2/3 2 )] and components of S n (x .. n ) 
given by S n> i(x : n ) — Sfe=o x l> S n,2( x o--n) = Sfe=i x \i S n ,3( x o--n) — Z)fc=i x kXk-i and 
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Figure 4. Boxplots of estimates of <j)v,0:n\nSn,i/n, produced with the fixed-lag technique, for 
the SV model in Example 4.2. Each box is based on 200 estimates and the size of the particle 
population was set to N = 1000 in all cases. 



S n ,A(xo:n) — Y^ik=oy kex P(~ Xk )- I n addition, up to terms not depending on parameters, 
c(6) = (7i + 1) log(/^ 2 )/2 + (7i + 1) log(a 2 )/2. 

Let S n = (S n ,i, S n ,2, S n ,3, Sjia) be a particle approximation of <p v fl: n \ n (S„;9 l ). To 
apply the Monte Carlo EM algorithm to the SV model is not more involved than for 
the autoregressive model in Example 4.1. In fact, the updating formulas appear to be 
completely analogous: 

0?T— (£, 2 -a^„, 3 ), (^) 2 = %. 

71 71 + 1 

As proposal kernel R^, we use an approximation, used by Cappe et al. ([2], Exam- 
ple 7.2.5) and inspired by Pitt and Shcpard [13], of the so-called optimal kernel, that is, 
the conditional density of X^+i given both and Y/c+i. 

We repeat the numerical investigations of Example 4.1. The resulting approximation 
of 4>v,0:n\nS n ,i, displayed in Figure 4, behaves similarly. Here, again, we observe that 
moderate values of the lag A are sufficient to suppress the bias. 

We finally compare the SMCEM parameter estimates obtained with the fixed-lag 
approximation and with the standard trajectory-based approximation on a simulated 
dataset of length n = 5000. Note that for the SMCEM procedure to converge to the MLE, 
it is necessary to increase the number of simulations that are performed as we progress 
through the EM iterations. We follow the recommendation of Fort and Moulines [8] and 
start by running 150 iterations of the Monte Carlo EM procedure with the number of 
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50 100 150 200 £50 0.15 0.17 0.19 

Number of Iterations 



Figure 5. SMCEM parameter estimates of f3, a and a from n = 5000 observations using the 
standard trajectory-based smoothing approximation. Each plot overlays 50 realizations of the 
particle simulations; the histograms pertain to the final (250th) SMCEM iteration. 




Number of Iterations 



Figure 6. SMCEM parameter estimates of j3, a and a from n = 5000 observations using the 
fixed-lag smoothing approximation with A = 40. Each plot overlays 50 realizations of the particle 
simulations; the histograms pertain to the final (250th) SMCEM iteration. 
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particles set at N = 100. For the subsequent 100 iterations, the number of particles in- 
creases at a quadratic rate with a final value (for the 250th Monte Carlo EM iteration) 
equal to N = 1600. The cumulative number of simulations performed during the 250 SM- 
CEM iterations is equal to 75,000 (times the length of the observation sequence), which 
is quite moderate for a Monte Carlo-based optimization method. In Figures 5 and 6, we 
display the superimposed trajectories of parameter estimates for 50 realizations of the 
particles, together with histograms of the final estimates (at iteration 250) when using, 
respectively, the trajectory-based approximation (in Figure 5) and the fixed-lag approxi- 
mation with A = 40 (in Figure 6). Not surprisingly, the fact that the particle simulations 
are iterated for several successive values of the parameter estimates only amplifies the 
differences observed so far. With the fixed-lag approximation, the standard deviation of 
the final SMCEM parameter estimate is divided by a factor of seven for 0, and three for 
a and a, which is quite impressive in the context of Monte Carlo methods: to achieve the 
same accuracy with the trajectory-based approximation, one would need about ten times 
more particles to compensate for the higher simulation variance. Table 1 shows that the 
fixed-lag approximation (third row) indeed remains more reliable than the trajectory- 
based approximation, even when the latter is computed from ten times more particles 
(second row). Note that, for the trajectory-based approximation, multiplying the number 
of particles by ten does not reduce the standard deviation of the estimates as much as 
expected from the asymptotic theory. This is certainly due to the moderate number of 
particles used in the baseline setting, as we start from TV = 100 particles during the first 
SMCEM iterations and terminate with N = 1600. 

Appendix A: Proofs 
A.l. Proof of Theorem 3.3 

The proof of Theorem 3.3 partly comprises the geometric ergodicity of the time-reversed 
conditional hidden chain (Theorem 3.1), partly the next proposition. In the following, 
we omit 8 from the notation for brevity. Moreover, let Ci(X n+1 ) be the set of bounded 



Table 1. Mean and standard deviation of SMCEM parameter estimates at the 250th iteration 
(estimated from 50 independent runs) 



Smoothing algorithm 




13 


a 


a 


Trajectory-based, 




0.5991 


0.9742 


0.1659 


with 75,000 total simulations 


std. 


0.0136 


std. 0.0019 


std. 0.0070 


Trajectory-based, 




0.5990 


0.9739 


0.1666 


with 750,000 total simulations 


std. 


0.0045 


std. 0.0011 


std. 0.0043 


Fixed-lag, 




0.5962 


0.9735 


0.1682 


with 75,000 total simulations 


std. 


0.0019 


std. 0.0006 


std. 0.0024 
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measurable functions / on X" +1 , possibly depending on Yo :n , of type /(xo : „) = f(xi- n ) 
for some function / : X"~ l+1 — ► R. 

Proposition A.l. Assume (Al) and (A2), and let f 6 ^(X" +1 ), < i < n. T/iere i/ien 
exisi universal constants B p and B , B p depending only on p, such that the following 
holds for all N > 1: 

(i) for allp>2, 

\\ t t>v,0:n\nfi ~ 0v,O:n|n/*||p|Sn 



< 



Bp\\fi\\x»-+ 1 , 



N(l-p) 



1 A \\W k \\ X 2 t0oP ^- k ) , Ij^olkocP 1 



-E 



fc=l 



w(ifc) 



(ii) 



|E A [<0m|n/i^n]-^,0:»l«AI 



< 



fl||/il|x" + i,c 

N(l-p) 2 



1 ™ 



fc=l 



{w(n)} s 



To prove Proposition A.l, we need some preparatory lemmas and definitions. In accor- 
dance with the mutation-selection procedure presented in Section 2, we have, for k > 1, 
A £ X ®{k+i) and i € {i j ... j JV}, that 

AT 

~~ f]W /l fc-lVS0:fc-l^V- 

j=l fe-1 

That is, conditional on ^ r ^ 1 , the swarm 1 < « < A^} of mutated particles at time 

fc is obtained by sampling N independent and identically distributed particles from the 
measure 

^ = <0: fc -H*-i*E-r (A- 1 ) 
Using this, define, for Ae X® {k+1 \ 

f*k\M ± J a -^f(x ; k )v£(dx -. k ), (A.2) 
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where the Radon-Nikodym derivative is given by, for xo :k £ X fc+1 , 

d ^k\ n , A W k (x k - 1:k )L k ■ ■ ■ L n . 1 (x .. k ,X n+1 ) 
^r X °- k) < 0:fc _ 1|fe _ 1 L fe - 1 ---L„_ 1 (X« + i) • 

In addition, for A G X, let 

Vo\ n {A) = J ^^(zoMdxo) 

with, for Xq <E X and ^gY, 

W (o: )io---i ra -i^o ) X™+ 1 ) 
(k W ^[ff(-,y )io---i„-i(X"+ 1 )] • 



Lemma A.2. Let f e B b (X n+1 ). Then, for alln>0 and N > 1, 

n 

4>u,Q-. n \nf ~ <t>v,0m\ n f = ^2<Pk (/). 

where, for k>l, 



k=Q 



^„* fc , n [/], (A.3) 



J2f=i dMo|n/d?(^' 



MO|n*0,n[/] 



and t/ie operators ^ k . n : $b(X fe+1 ) — ► $b(X < fc < n + 1, are, for fixed points XQ ]k £ 
X fc+1 , defined by 



Lk ■ ■ ■L n -if(x :k) Lk ■ ■ ■ L n -if(x :k) 



L k ■ ■ ■ L„_i(x 0: fe,X"+ 1 ) L k ■ ■■ L n _ 1 (x 0:k ,X n + 1 ) ' 

Proof. As a starting point, consider the decomposition 

$v,0:n\nf ~ 0i>,O:n|n/ 

^o-^o • • • L n -if 



(A.4) 



• • • in-l(X" 



IjT — <Pvfi:n\nf 



E 

fe=i 



i/.oifc-iife-i^fe- 1 ' ' ' L n -if 



( t ) v,Q:k\k L k--- L n-lf 
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Using the definitions (A.l) and (A. 2) of 77^; and respectively, we may write, for 

k> 1, 

^0:fc-l|fc-l L fc-l ' --Ln-lf 



y,0:fc-l|fe 



_ 1 L k _ 1 ---L n ^ 1 (X»+i) 
W k {-)L k ---L n -xf{-) 



lN 



u,0:fc-l|fc-l i *-l"' i "-l( XB+1 

iy fe (-)i fc ---i„-i(-,X» +1 ) 



I /,0:fe-l|fe-l L fe-l ' "- C 'n-l( Xri 



l T r, 1n ^■■■£n-l/M \ 

TT^*.«^(-)+ Lfc ... Ln _ i(S(j:fciXn+1) | 



ife ' ' , in-l/(»0:fc) 



Lfe---L n _l(xO:fc,X»+ 1 ) 



L k ---L n - 1 (x :k,X n+1 )' 
On the other hand, 

^0:k\k L k--- L n-lf 

_EiIid^„/dr ? f(^)^ )n [/](^ i 



L k ■ ■ ■ L n -lf(xQ:k) 

L k ---L n _ 1 (x 0:k ,X n + 1 ) 



and, by combining the two latter identities, it follows from the definition (A. 3) of tp k (/) 
that, for k > 1, 



^ V (/) 



The identity 



V0:fc fc 



L k - ■ ■ L n _if 



kN 



■,Q:fc-l|fc-l-kfc-l ' --Ln-lf 

%. Mk L k ■ ■ • L„_ 1 (X«+i) ,^-1 • • • L„-i(X»+i) 



„ Lq • • • L n -if 



can be verified in a similar manner. 



□ 



Note that, conditional on T k _ 1 , the first term on the right-hand side of (A. 3) is noth- 
ing but an importance sampling estimate of fJ- k i n ^k,n[f]i based on N independent rj^ - 
distributed variables. 
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Lemma A. 3. Assume (Al) and let, for n > and <i <n, /; G Ci(X n+1 ) . Furthermore, 
let, for k>0, the operator ^k,n be defined via (A. 4). Then, 

ll*fc,n[/i]||x*+i 1 oo<2/> 0V(< -* ) ||/i|| X »+i I oo. 

Proof. For k>i, we bound *S?k,n[fi] from above by 2|| /i||x™+ 1 ,c3oi however, for k < i, a 
geometrically decreasing bound of the function can be obtained by using the forgetting 
property of the conditional latent chain. Hence, by the Markov property of the posterior 
chain and using the definition of the forward kernels (see (2)), 

Lk-'-L n -ifi(xo:k) Tuttrv my n i 

L k ■ ■ •L„_i(X0:fe,X n+i ) 

= E[E[f i (x i .. n )\x i = xi,g n ]\x k = Xk ,g n ] 

= F fc |„ • ■ ■F i _ lln {x k ,E[f i (X i:n )\X i = -,g n ]} 
with xo:fc G X +1 . Therefore, we may, for k <i, rewrite ^k,n[fi]{xo-.k) as 

^k,n[fi](xO:k) 

{F fc |„ ■ ■■F i _ 1 \ n (x k ,dx i ) - F fc |„ ■ ■ ■F i _ 1 \ n (x k ,dxi)}E[f i (X i .. n )\X i = Xi,g n ]. 

c 

Applying Theorem 3.1 to this difference yields 
\^k, n [fi)(x :k)\ 

< 2\\E[fi(X i:n )\Xi = •,^„]||x.oo||F fe | Il ■■•F i _ 1 | n (xjfe,-) -F fc |„ ■ • ■F i _ 1 \ n (x k , ■ ) ! I t v 

< 2\\E[f i (X i:n )\X i = ■,g n ]\\x, ooP i - k < 2||/ t || X n + i, 00 p l - fc . 

□ 

Lemma A. 4. Assume (Al) and let n>0. Then, for all 1 < k < n, XQ- k G X fc+1 , y k G Y 
and N> I, 

d ^k\n, ||W fc || X a,oo 



(a;o:fc) < 



d^f ' M{Vk)Q- P)o-- 
where n^f and H> k i n are defined in (A.l) and (A. 2), respectively. 

Proof. First write, for XQ. k G X fe+1 and y k +\ G Y, 
L k ---L n _ 1 (x 0:k ,X n+1 ) 

= / q(xfe,x fc+ i)X fc+ i • • ■L n -i(x .. k+ i,X n+1 )g(x k+ i,y k+ i)fj,(dx k+1 ) 
Jx 

<cr + L k+1 ■ ■ ■ L n - 1 (x 0:k+1 ,X n+1 )g(x k+1 ,y k+1 )fi(dx k+1 ). (A.5) 
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Now, since the function L k+ i ■ ■ ■ L„_i(-, X™ +1 ) is constant in all but the last component 
of the argument, 

Lk~i ■ ■ •iri-i(^0:fe-i,X rl+1 ) 

= / q(x k -i,Xk)g(x k ,y k ) / q(x k ,x k+1 )L k+1 ■ ■ •L„_i(x :fc+i,X™ +1 ) 
Jx Jx 

x g(x k+ i,y k+1 )n® 2 (dx k:k+1 ) 

> h ig{y k )(T 2 _ / L k+ i ■ ■ ■ L n - 1 (x 0:k+1 ,X n+1 )g(x k+1 ,y k+1 )fi(dx k+1 ). (A.6) 
Jx 

Since the integrals in (A. 5) and (A.6) are equal, the bound of the lemma follows. □ 

Proof of Proposition A.l. We start with (i). Since, conditional on , the random 
variables /i(£ -n )j 1 — 3 ' — N , are independent and identically distributed with expecta- 
tions 



1 N 



W ■ 



(A.7) 



applying the Marcinkiewicz-Zygmund inequality provides the bound 



1 N "N 1 N N 

E fi(io-.n) ~ E /« (£o.-ra ) 



JV 



<Q>ll/i|| X » +1 ,oo> (A. 



where C p is a universal constant depending only on p. Having control of this discrepancy, 
we focus instead on the L p error associated with the weighted empirical measure (/>^ . n \ n - 
We make use of the identity 

a/b-c= (a/b)(l - b) + a - c 

on each term of the decomposition provided by Lemma A. 2. This, together with 
Minkowski's inequality, gives us the bound 



fk (f%)\\p\g n \/^ 



< 



1 N du N 



+ ll*fc,n[/i]||x*+i 



i N dn N 
N dryf } 



(A.9) 



Applying the Marcinkiewicz-Zygmund inequality to the first term of this bound gives 



1 du 

.7 = 1 k 
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(A.10) 



and treating the second term in a similar manner yields 



N 



N A,,N 



(1// 



iv 

N f-f d?7 

.7 = 1 



d/i 



d// 



iV 



p 

X fc +i,oc 



(A.H) 



Thus, we obtain, by inserting these bounds into (A. 9) and applying Lemmas A. 3 and 
A.4, 



NM(fi)\\ P \g n v^<^p' 



iC l/p p 0V(i-k) ll^fclix2,ooll/i||x" + i, 

M.9(2/fe)(l - P)o- 



(A.12) 



For the first term of the decomposition provided by Lemma (A. 2), we have, using the 
same decomposition technique as in (A. 9) and repeating the arguments of Lemma A.4, 



N\\^(f i )\\ plSn <2C 1 p /i> 



p0\n 



|*0:n[/»]||x,c 



< 



4 £i/p i ll^lkoolLAJ^ 



(A.13) 



vg{yo){l - p) 



Now, (i) follows by a straightforward application of Minkowski's inequality together 
with (A.8), (A.12) and (A.13). 

We turn to (ii). By means of the identity 

a/b-c= (a/b)(l - b) 2 + (a - c)(l - b) + c(l - b) + a - c 

applied to (A. 3), we obtain the bound 

Ie^C/OI&v^]! 



< ll**,n[/<]llx*+i,c 



1 N dn N 

1 \ - a Pk\n , jy,j. 



1 w du^ 



2|e„v^« 1 



i w du^ 

iV ^ dyyf Uo:fc j 



2|S„V^«_ 1 



Thus, we get, by reusing (A. 10) and (A. 11), 
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(A.14) 



iV{ W (^)} 2 (l-p) 2 ^ 



and treating the last term of the decomposition in a completely similar manner yields 



_JWo|lx, oll/illx"+i,oo 



Finally, from (A. 7), we conclude that the multinomial selection mechanism does not 
introduce any additional bias and, consequently, (ii) follows from the triangle inequality, 
together with (A.14) and (A.15). □ 

Having established Proposition A.l, we are now ready to proceed with the proof of 
Theorem 3.3. 

Proof of Theorem 3.3. Decomposing the difference in question yields the bound 

n-l 

ll7^' A " - ln\\p\g n < X!ll^O:fe(A„)|fc(A„) s fe -0i/ ) O:*(A»)|k(A n )Sk|| p |0 n 
k=0 

(A.16) 

n—A„ 

+ ^ l0iAO:fc+A„|A;+A„ s /c - </V,0:n|n s fc|, 
k=0 

where we have set fc(A n ) = (k + A„) A n. By writing 

E[s k (X k , X k+ i)\X k+ A n + l = X k+ A„ + l,Gk+A„] 

= E[E[s fe (Afc, X k+ i)\X k+ i = X k+ i,G k+ A n ]\X k+ A„+l = Xk+A n +l,Qk+A n ] 
= Bv,k+A n \k+A„ ' ' '^v,k+l\k+A n ( x k+A„+l, Sk\k+A n ) 
with, for x € X, 

Sk\k+A n {x) =E[s k (X k ,X k+1 )\X k+1 =x,Qk+A„], 

we get that 

<^,0:fc+A„|fc+A„ s fc — </V,0:n|n s fc 

= ^fc+A„ + l|fe+A„B l/:fc+ A„|fe+A„ ' ' ■Rv,k+l\k+A n ( s k\k+A n ) 
— i J k+A n + l\n^u,k+A„\k+A„ ' ' ' ^i^,k+l\k+A n ( S fc|fc+A„ )i 

where we have defined, for £,m>0, ip£\m = P(-^ £ •|£7m)- Hence, we obtain, using the 
exponential forgetting property (see Theorem 3.1) of the time-reversed conditional hidden 
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chain, 



b v ,0:k+A n \k+A„Sk - 4>v,0:n\nSk\ 
< 2||sfc|fc + A„||x,oo||V'fc+A„+l|fe+A„B lyifc+ A„|fe+A„ ' ' •'&v,k+l\k+A n v) 
- i J k+A„ + l\n^u,k+A„\k+A n ' ' ' ^i/,k+l\k+A n (") ||tV 

<2p A «||s fe || x2>00 . 



(A.17) 



Substituting (A.17) and the bound of Proposition A.l(i) into the decomposition (A. 16) 
completes the proof of (i). The proof of part (ii) is entirely analogous and is omitted for 
brevity. □ 



A. 2. Proof of Proposition 3.4 

(A4) Let fi be the function of Proposition A.l. For p > 2, £ > 1, there exists a constant 
a'" ( ) (/,)6l + such that 



max< E 



\\Wk\\lJ\mU, 



.E[||/ i ||^» + i lOO ];0<fe<n^<aJ3(/ i ). 



Under assumption (A4), we have the following result. 

Proposition A. 5. Assume (Al) and (A2). There then exist universal constants B p and 
B, Bp depending only on p, such that the following holds true for all N > 1; 

(i) if assumption ( A4) is satisfied for I = p > 2 , then 



\\^u,0:n\nfi ~ 0f,O:n|n/i||p 



< 



<N{l-p) 

(ii) if assumption (A4) is satisfied for p = 2 , £=1, then 

I-P 1 



1 - p l n — i p 1 
1 1 7 

cr_(l — p) er_ (dv/dp)- 



l E [<t> v ,0:n\nf% ~ 4>vfl:n\nfi]\ < jy( 1 '_ /3 )2 



o-l(l -p) at (Avjdp)t 



Proof. The proof of the first part is straightforward: combining Proposition A.l and 
Minkowski's inequality provides the bound 



ll0i/,O:n|n/« — </V,0:ra|n/i||p 

= E 1/p [||^ 0:n | ri / i -^ i0: n|n/il 



P 

p\Qn 



< 



Br, 



{pg(Y k )}p 



OV(i-fc) 



Sequential Monte Carlo smoothing 
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(diz/d/*). 



-E 



i//' 



l^o||£,J|/i||£„ +1 



{«7(*b)} p 



We finish the proof by substituting the bounds of assumption (A4) into the expression 
above and summing up. The proof of the second part follows similarly. □ 

Proof of Proposition 3.4. The proof of the first part follows by applying Proposition 
A. 5 and the bound (A. 17) to the decomposition 

n-1 

ll7^' A " - 7n||p < X] II^O:fe(A„)|fc(A„) s fc ~ </V,0:fc(A„)|fc(A„)Sfc||p 



k=0 



n-A„ 



( f'iy,Q:k+A rl \k+A 7l s k — <Pvfi:n\n s k || 



fc=0 

The second part is proved in a similar manner. □ 
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